Code
import pandas as pd
df = pd.read_parquet("data/eda.parquet")import pandas as pd
skills_data = {
"Name": ["Binderiya", "Pranjul", "Pratham", "Panyang"],
"Python": [4, 4, 5, 3],
"SQL": [4, 4, 5, 4],
"Machine Learning": [2, 3, 2, 2],
"PySpark": [3, 3, 3, 3],
"Excel": [4, 5, 5, 4],
"Data Visualization": [5, 5, 3, 3],
"Power Bi/ Tableau": [4, 5, 3, 4],
"Version Control Git": [4, 4, 3, 3],
"ETL/Data pipeline": [3, 2, 1, 2],
"Communication": [4, 4, 5, 3],
"Project Management": [5, 5, 5, 3],
"Cloud Computing": [4, 4, 2, 2]
}
df_skills = pd.DataFrame(skills_data)
df_skills.set_index("Name", inplace=True)
df_skills| Python | SQL | Machine Learning | PySpark | Excel | Data Visualization | Power Bi/ Tableau | Version Control Git | ETL/Data pipeline | Communication | Project Management | Cloud Computing | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Name | ||||||||||||
| Binderiya | 4 | 4 | 2 | 3 | 4 | 5 | 4 | 4 | 3 | 4 | 5 | 4 |
| Pranjul | 4 | 4 | 3 | 3 | 5 | 5 | 5 | 4 | 2 | 4 | 5 | 4 |
| Pratham | 5 | 5 | 2 | 3 | 5 | 3 | 3 | 3 | 1 | 5 | 5 | 2 |
| Panyang | 3 | 4 | 2 | 3 | 4 | 3 | 4 | 3 | 2 | 3 | 3 | 2 |
import plotly.express as px
fig = px.imshow(
df_skills,
text_auto=True,
color_continuous_scale="YlGnBu",
aspect="auto"
)
fig.update_layout(
title="Team Skill Levels Heatmap",
xaxis_title="Skills",
yaxis_title="Team Members",
width=700,
height=400,
margin=dict(l=50, r=20, t=50, b=50)
)
fig.write_html(
"figures/skill_gap_plot1.html",
include_plotlyjs="cdn",
full_html=False
)import plotly.graph_objects as go
from IPython.display import IFrame
fig = go.Figure()
for name in df_skills.index:
values = df_skills.loc[name].tolist()
values += values[:1] # close the loop
fig.add_trace(go.Scatterpolar(
r=values,
theta=df_skills.columns.tolist() + [df_skills.columns[0]],
fill='toself',
name=name
))
fig.update_layout(
polar=dict(radialaxis=dict(visible=True, range=[0, 5])),
showlegend=True,
title='Team Skills Radar Chart'
)
fig.write_html(
"figures/skill_gap_plot2.html",
include_plotlyjs="cdn",
full_html=False
)From this radar chart visualization we can see that our team has a lot of room for improvement for skills like PySpark and Machine Learning. Also we can see that not a lot of our team mates are confident in their skills in Cloud Computing and ETL.
keywords = ['Data Analyst', 'Business Analyst', 'Data Engineering', 'Deep Learning',
'Data Science', 'Data Analysis','Data Analytics', 'Market Research Analyst'
'LLM', 'Language Model', 'NLP', 'Natural Language Processing',
'Computer Vision', 'Business Intelligence Analyst', 'Quantitative Analyst', 'Operations Analyst']
match = lambda col: df[col].str.contains('|'.join(keywords), case=False, na=False)
df['DATA_ANALYST_JOB'] = match('TITLE_NAME') \
| match('SKILLS_NAME') \
| match('SPECIALIZED_SKILLS_NAME')
df['DATA_ANALYST_JOB'].value_counts()DATA_ANALYST_JOB
False 38212
True 33042
Name: count, dtype: int64
import ast
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
# Safely apply literal_eval only to non-null values
df['SKILLS'] = df['SKILLS_NAME'].apply(lambda x: ast.literal_eval(x) if pd.notnull(x) else [])
data_skills = df[df['DATA_ANALYST_JOB']]['SKILLS'].explode().value_counts().reset_index()
data_skills.columns = ['Skill', 'Count']
fig = px.bar(data_skills, x='Skill', y='Count',
title="Top Skills",
labels={'Skill': 'Skill Name', 'Count': 'Frequency'},
color='Skill')
df_skills.index = df_skills.index.str.strip()from collections import defaultdict
# Lowercase everything
team_skills = [s.lower().strip() for s in df_skills.columns]
job_demand_raw = data_skills.copy()
job_demand_raw['Skill'] = job_demand_raw['Skill'].str.lower().str.strip()
# New dict to map cleaned team skill to total count from job postings
skill_demand_map = defaultdict(int)
for _, row in job_demand_raw.iterrows():
skill_in_posting = row['Skill']
count = row['Count']
for team_skill in team_skills:
if team_skill in skill_in_posting:
skill_demand_map[team_skill] += countteam_skills = [s.strip().lower() for s in df_skills.columns]
print("Team skills:", team_skills)
print(job_demand_raw['Skill'].head(10).tolist())
for skill_text in job_demand_raw['Skill'].head(10):
for team_skill in team_skills:
if team_skill in skill_text:
print(f" '{team_skill}' found in: '{skill_text}'")Team skills: ['python', 'sql', 'machine learning', 'pyspark', 'excel', 'data visualization', 'power bi/ tableau', 'version control git', 'etl/data pipeline', 'communication', 'project management', 'cloud computing']
['data analysis', 'sql (programming language)', 'communication', 'management', 'python (programming language)', 'tableau (business intelligence software)', 'dashboard', 'computer science', 'problem solving', 'power bi']
'sql' found in: 'sql (programming language)'
'communication' found in: 'communication'
'python' found in: 'python (programming language)'
job_demand = pd.Series(skill_demand_map)
job_demand.name = "Count"
team_avg = df_skills.mean()
team_avg.index = team_avg.index.str.strip().str.lower()
# Now match only overlapping skills
common_skills = job_demand.index.intersection(team_avg.index)
team_avg = team_avg[common_skills]
job_demand = job_demand[common_skills]
# Normalize job demand
job_demand_normalized = 5 * (job_demand / job_demand.max())
job_demand_normalized.name = "Job Demand (Normalized)"
# Combine
comparison_df = pd.concat([team_avg, job_demand_normalized], axis=1)
comparison_df.columns = ["Team Average Skill", "Job Demand (Normalized)"]
comparison_df["Skill Gap"] = comparison_df["Job Demand (Normalized)"] - comparison_df["Team Average Skill"]
comparison_df.sort_values("Skill Gap", ascending=False, inplace=True)
comparison_df| Team Average Skill | Job Demand (Normalized) | Skill Gap | |
|---|---|---|---|
| communication | 4.00 | 5.000000 | 1.000000 |
| sql | 4.25 | 4.959112 | 0.709112 |
| machine learning | 2.25 | 0.869894 | -1.380106 |
| python | 4.00 | 2.283191 | -1.716809 |
| data visualization | 4.00 | 1.520015 | -2.479985 |
| excel | 4.50 | 1.959357 | -2.540643 |
| cloud computing | 3.00 | 0.266590 | -2.733410 |
| pyspark | 3.00 | 0.104878 | -2.895122 |
| project management | 4.50 | 1.530850 | -2.969150 |
import plotly.express as px
fig = px.bar(
comparison_df,
x='Skill',
y='Skill Gap',
color='Skill Gap',
color_continuous_scale='RdBu_r',
title='Skill Gaps: Job Market Expectations vs. Team Capability',
labels={'Skill Gap': 'Gap (Job Demand - Team Skill)', 'Skill': 'Skill'},
)
fig.add_hline(y=0, line_dash='dash')
fig.update_layout(
xaxis_tickangle=-45,
yaxis_title='Gap (Positive = Market expects more)',
font=dict(size=13),
height=500,
plot_bgcolor='white',
)
fig.write_html(
"figures/skill_gap_plot3.html",
include_plotlyjs="cdn",
full_html=False
)Our skill gap analysis compares the team’s average skill levels with normalized job market expectations. The “Skill Gap” metric is calculated based on the frequency of each skill’s appearance in job postings, rather than a direct assessment of skill proficiency required by employers. As such, while skills like SQL and Python appear to have smaller gaps, this may partly reflect the fact that they are mentioned slightly less often in the postings relative to emerging areas like cloud computing or machine learning. Nevertheless, SQL and Python remain critical foundational skills that the job market consistently expects from candidates.
The results indicate that the team meets or slightly exceeds market expectations in communication and SQL. However, there are notable gaps in technical domains such as machine learning, cloud computing, PySpark, data visualization, and project management. These areas should be prioritized for upskilling to align better with market demand.
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import ast
df['SKILLS'] = df['SKILLS_NAME'].apply(lambda x: ast.literal_eval(x) if pd.notnull(x) else [])
all_skills = [skill.lower().strip() for skills in df['SKILLS'] for skill in skills]
text = " ".join(all_skills)
wordcloud = WordCloud(
width=800,
height=400,
background_color='white',
max_words=150,
colormap='plasma'
).generate(text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.title('Word Cloud of Skills from Job Postings', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
# Step 6: (Optional) Save it
wordcloud.to_file("figures/skills_wordcloud.png") The word cloud highlights the most frequently mentioned skills in job postings related to data science, business analytics, and machine learning roles. Larger words like “programming language,” “problem solving,” “business intelligence,” “SQL programming,” and “data analysis” indicate skills that are highly sought after by employers. Other prominent terms such as “project management,” “Python programming,” and “Microsoft Excel” further emphasize the blend of technical, analytical, and project-oriented skills expected in the industry. This visualization captures how employers prioritize a balance between technical proficiency and business acumen.